hxl tag
SMUTF: Schema Matching Using Generative Tags and Hybrid Features
Zhang, Yu, Di, Mei, Luo, Haozheng, Xu, Chenwei, Tsai, Richard Tzong-Han
We introduce SMUTF, a unique approach for large-scale tabular data schema matching (SM), which assumes that supervised learning does not affect performance in open-domain tasks, thereby enabling effective cross-domain matching. This system uniquely combines rule-based feature engineering, pre-trained language models, and generative large language models. In an innovative adaptation inspired by the Humanitarian Exchange Language, we deploy 'generative tags' for each data column, enhancing the effectiveness of SM. SMUTF exhibits extensive versatility, working seamlessly with any pre-existing pre-trained embeddings, classification methods, and generative models. Recognizing the lack of extensive, publicly available datasets for SM, we have created and open-sourced the HDXSM dataset from the public humanitarian data. We believe this to be the most exhaustive SM dataset currently available. In evaluations across various public datasets and the novel HDXSM dataset, SMUTF demonstrated exceptional performance, surpassing existing state-of-the-art models in terms of accuracy and efficiency, and} improving the F1 score by 11.84% and the AUC of ROC by 5.08%.
Predicting Metadata for Humanitarian Datasets Using GPT-3
Responding to humanitarian disasters quickly, better still, anticipating them can save lives [1]. Data is key to this, not just having lots of data, but clean data which is well understood [2] in order to create a clear view of the situation on the ground. In many cases this critical data is stored in hundreds of small spreadsheets, so piecing them altogether can be time consuming and difficult to maintain as new data comes in during a humanitarian incident. Automating the process of data discovery would potentially speed responses and improve outcomes for affected people. One way to make discovery easier is to ensure that tabular data has metadata describing each column.